We are trying to determine which products have high association i.e. those that are likely to be bought together
We will implement this using Market Basket Analysis (MBA), which uses Association Rule Mining (Apriori algorithm)
If-then statements that helps us to show the probability of relationships between data items withIn large data sets
Eg
Has two parts:
An association rule for items \(M_1\) (antecedent) and \(M_2\) (consequent) can then be expressed as:
\(M_1 \rightarrow M_2\) i.e. representation of having item \(M_2\) on the itemset which has \(M_1\) on it
Note that implication here is co-occurrence and not causality
There are three metrics that help us understand the strength of the relationship (rule):
Support -indicates how frequently the itemset occurs
Confidence - the number of times the rule is found to be true (i.e. likelines of occurrence of consequent given the the antecedent has occurred)
Lift - How many times the rule is expected to be found true
Lift(\(M_1 \rightarrow M_2\) ) =\(\frac{\text{confidence}(M_1 \rightarrow M_2)}{\text{support}(M_2)}\) = \(\frac{\frac{\text{# user transactions containing $M_1$ and $M_2$}}{\text{# user transactions containing $M_1$}}}{\text{# user transactions containing $M_2$}}\)
In cases where \(M_1\) actually leads to \(M_2\), value of lift will be greater than 1
Generally:
The general worflow is:
file_path <- "Import files/Market_Basket_Optimisation.csv"
data <- read_csv(file_path, col_names = FALSE)
Each row represents the transactions for individual customers
We convert the data frame to a sparse matrix (called transactions)
Sparse matrix is a matrix of 0s and 1s, with each row and column representing the various products
library(arules)
dataset <- read.transactions(file_path, sep = ",", rm.duplicates = TRUE )
## distribution of transactions with duplicates:
## 1
## 5
summary(dataset)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19 20
## 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
library(RColorBrewer)
itemFrequencyPlot(dataset,
topN=20,
col=brewer.pal(8,'Pastel2'),
main='Relative Item Frequency Plot',
type="relative",
ylab="Item Frequency (Relative)")
The basic steps in implementing the Apriori algorithm are as follows:
Set up a minimum support and confidence
Take all the subsets in transactions having higher support than the minimum support
Take all the subsets in transactions having higher confidence than the minimum confidence
Sort the rules by decresing lift
The choice of support(how frequently the item appears in your data set) and confidence (frequency of the rule) varies by business case: depends on the goal, data size etc
For minimum support, we want products that are bought at least two times a day i.e. 2*7/len(dataset).
Minimum length is an specifies the minimum number of products you’d like to have in your rule (Not mandatory to include this)
Maximum length specifies the maximum number of products you’d like to have in your rule (Not mandatory to include this)
rules <- apriori(dataset,
parameter = list(support = 14/nrow(dataset),
confidence = .2,
minlen = 2,
maxlen = 20
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.001866418 2
## maxlen target ext
## 20 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [3193 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Removing Redundant Rules
#rules <- rules[!is.redundant(rules)]
subset.rules <- which(colSums(is.subset(rules, rules)) > 1) # get subset rules in vector
rules <- rules[-subset.rules] # remove subset rules.
rules_df <- DATAFRAME(rules) %>% arrange(desc(lift))
DT::datatable(rules_df)
library(arulesViz)
plot(rules,jitter = 0, engine = "plotly")
Rules with high confidence tend to have low support, and vice versa
Rules with high lift tend to have relatively low support
plot(rules, method = "two-key plot")
subrules <- head(sort(rules, by="lift"), n = 30, by = "lift")
#plot(subrules, method = "graph", engine = "htmlwidget")
plot(subrules, method = "graph",
control = list(
# edges = ggraph::geom_edge_link(
# end_cap = ggraph::circle(4, "mm"),
# start_cap = ggraph::circle(4, "mm"),
# color = "black",
# arrow = arrow(length = unit(2, "mm"), angle = 20, type = "closed"),
# alpha = .2
# ),
nodes = ggraph::geom_node_point(aes_string(size = "support", color = "lift"))
#nodetext = ggraph::geom_node_label(aes_string(label = "label"), alpha = .8, repel = TRUE)
)
) +
scale_color_gradient(low = "dodgerblue", high = "red") +
scale_size(range = c(2, 10))
# Shiny App for Interactive Manipulations and Visualization
#ruleExplorer(subrules, sidebarWidth = 2, graphHeight = '600px')